convergence rate
Deep Neural Networks for Doubly Robust Estimation with Nonprobability Survey Samples
Dai, Yufang, Luo, Shihua, Lou, Wendy, Wang, Zilin, Lu, Xuewen
Integrating probability and nonprobability survey samples is an important problem in modern survey sampling. Nonprobability samples often contain rich outcome information but may lack population representativeness, whereas probability samples provide design-based auxiliary information but may not contain the study variable. We propose a deep neural network (DNN)-assisted doubly robust framework for estimating the finite population mean from these two data sources. The proposed method models the logit sampling score for the nonprobability sample as an unknown nonparametric function and estimates it by maximizing a pseudo-likelihood that combines information from the nonprobability sample and a reference probability sample. The DNN parameters are optimized using the ADAM algorithm. The resulting DNN-estimated sampling scores are incorporated into a DNN-assisted inverse-probability weighted estimator and a deep doubly robust estimator. We establish consistency and convergence rates under regularity conditions and evaluate the finite-sample performance of the proposed estimators through simulation studies and an empirical application using Pew Research Center and Behavioral Risk Factor Surveillance System data. The results suggest that the proposed estimators can improve robustness to parametric propensity-score misspecification, especially when the true selection mechanism is nonlinear.
From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models
Liang, Yuchen, Shroff, Ness, Liang, Yingbin
Discrete diffusion models have achieved strong empirical performance in text and other symbolic domains, but, especially for uniform-rate models, they often require many steps to generate a single sample. Existing acceleration methods either rely on training additional quantities or suffer from slow mixing. In this work, we propose a novel Gibbs-based corrector for discrete diffusion models, termed Gibbs-Accelerated Discrete Diffusion (GADD). GADD leverages the structure of the concrete score function to construct Gibbs posterior likelihoods directly, without requiring any additional training beyond standard score estimation. We show that GADD achieves an overall sampling complexity of $\mathcal{O}(\mathrm{polylog} (\varepsilon^{-1}))$, yielding the first such rate for diffusion-based samplers for uniform-rate discrete diffusion models. We also conduct numerical experiments demonstrating the practical advantages of GADD across synthetic data, zero-shot text sampling, and zero-shot conditional music generation. These results corroborate the theory and show that GADD consistently improves sample quality and wall-clock efficiency over standard baselines, including vanilla Euler methods and CTMC correctors. Beyond this, our theoretical analysis introduces a novel framework for analyzing predictor-corrector methods in discrete diffusion models, which may be of independent interest. Unlike existing approaches that rely on the Girsanov change-of-measure technique, our method is based on an induction argument that tracks error propagation across predictor iterations while accounting for inaccuracies in the corrector updates.
Estimating Mixture Distributions via Stochastic Mirror Descent
Ahmadypour, Mohammadreza, Javidi, Tara, Koushanfar, Farinaz
We revisit the classical problem of estimating an unknown distribution from its samples by fitting a mixture model that minimizes cross-entropy loss. Framing the task as a stochastic convex optimization problem over the space of $ M $-component mixture distributions, we propose a family of estimators derived from the stochastic mirror descent (SMD) algorithm. This optimization-based approach provides a principled and flexible framework that generalizes traditional estimators and proposes a variety of novel estimators through the choice of Bregman divergences. A key advantage of our method is that it scales efficiently with the number of candidate components $ f_i $; that is, one can employ a large set of basis distributions in the mixture model without incurring significant computational overhead. This enables richer approximations and improved estimation accuracy. Moreover, in the case of categorical distribution (discrete outcomes) our estimators do not require a strict lower bound, in other words our framework does not require the precise knowledge of the support of the distribution. We demonstrate that, under mild conditions, the proposed $ φ$-SMD estimators achieve near-optimal convergence rates in both Kullback-Leibler (KL) divergence and $ \ell_2 $-norm and offer practical benefits when computation is expensive. Our numerical analysis highlights improved performance guaranties over classical estimators, particularly in terms of sample efficiency and scalability.
Boosted Stochastic Frank-Wolfe for Constrained Nonconvex Optimization
Nandhan, Navil, Khademi, Abbas, Silveti-Falls, Antonio
The boosted Frank-Wolfe algorithm accelerates the classical Frank-Wolfe algorithm by better aligning the update direction with the negative gradient. Its analysis, however, has been limited to deterministic convex problems, with step sizes that require either line search or knowledge of the Lipschitz constant of the gradient. We develop a novel step size strategy that does not depend on the Lipschitz constant of the gradient, which allows us to extend the boosted Frank-Wolfe algorithm to the stochastic setting. We prove that boosting with this step size strategy can be combined with many modern gradient estimators, including SAGA, L-SVRG, SAG, Heavy Ball momentum, and zeroth-order estimators, among others, while retaining the worst-case convergence rates of ordinary stochastic Frank-Wolfe. Our analysis also yields the first convergence rates for boosted Frank-Wolfe on nonconvex and quasar-convex objectives, results which are new even for deterministic problems. Experiments on sparse logistic regression and quantum process tomography show that stochastic boosted Frank-Wolfe achieves faster convergence per gradient oracle call (and on wall-clock) compared to the non-boosted baseline.
Large Dimensional Kernel Ridge Regression: Extending to Product Kernels
Zhou, Yang, Li, Yicheng, Cheng, Yuqian, Lin, Qian
Recent studies have reported $\textit{saturation effects}$ and $\textit{multiple descent behavior}$ in large dimensional kernel ridge regression (KRR). However, these findings are predominantly derived under restrictive settings, such as inner product kernels on sphere or strong eigenfunction assumptions like hypercontractivity. Whether such behaviors hold for other kernels remains an open question. In this paper, we establish a broad, new family of large dimensional kernels and derive the corresponding convergence rates of the generalization error. As a result, we recover key phenomena previously associated with inner product kernels on sphere, including: $i)$ the $\textit{minimax optimality}$ when the source condition $s\le 1$; $ii)$ the $\textit{saturation effect}$ when $s>1$; $iii)$ a $\textit{periodic plateau phenomenon}$ in the convergence rate and a $\textit {multiple-descent behavior}$ with respect to the sample size $n$.
Optimal Asymptotic Rates for (Stochastic) Gradient Descent under the Local PL-Condition: A Geometric Approach
Kassing, Sebastian, Kruse, Thomas
Stochastic gradient descent (SGD) has been studied extensively over the past decades due to its simplicity and broad applicability in machine learning. In this work, we analyze the local behavior of gradient descent and stochastic gradient descent for minimizing $C^2$-functions that satisfy the Polyak-Lojasiewicz (PL) inequality and under a multiplicative gradient noise model motivated by overparameterized neural networks. Using a geometric interpretation of the PL-condition, we prove a simple yet surprising fact: in this possibly non-convex setting, the asymptotic convergence rate of (S)GD matches the rate obtained for strongly convex quadratics.
Quantitative Local Convergence of Mean-Field Stein Variational Gradient Flow
Chizat, Lénaïc, Colombo, Maria, Colombo, Roberto, Fernández-Real, Xavier
Stein Variational Gradient Descent (SVGD), introduced in [LW16], is a deterministic interactingparticle method for sampling from a target probability measure π e V, only requiring access to V. In the mean-field and continuous-time limit, the distribution of particles converges to a flow (ρt) in the space of probability measures that solves a variant of the Fokker-Planck equation with a velocity field smoothed by weighted convolution with a positive definite kernel [LLN19]. This flow can be interpreted as the gradient flow of the relative entropy H( |π) with respect to a "kernelized" Wasserstein metric [Liu17]. The goal of this paper is to investigate the convergence of (ρt) towards π. To this end, we focus on the model case of Riesz kernels of order s on the d-dimensional torus Td. This is a family of translation-invariant kernels whose Fourier coefficients decay as |ξ| 2s. The parameter s hence directly controls the "smoothing strength" of the interaction; in particular, continuous kernels correspond to s > d/2, C1 kernels to s > (d+1)/2, and C2 kernels to s > (d+2)/2. What is known: qualitative weak convergence The starting point of convergence analyses is the energy dissipation formula [Liu17] d dt H(ρt|π) = Is(ρt|π), (1.1) Authors are listed in alphabetical order.
Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework
In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past $Q$-functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of $\mathcal{O}((1-γ)^{-1}\log((1-γ)^{-1}ε^{-1}))$ for computing an $ε$-optimal policy, without modifying the MDP, adding regularization beyond the mirror map inherent in the update, or using adaptive, trajectory-dependent stepsizes. For the unregularized greedy case, corresponding to dual-averaged policy iteration, we also prove finite termination. The same Bellman-operator framework further extends to discounted MDPs with linear function approximation and stochastic shortest path problems.
Almost Sure Convergence Rates of Stochastic Approximation and Reinforcement Learning via a Poisson-Moreau Drift
Liu, Xinyu, Xie, Zixuan, Zhang, Shangtong
Establishing almost sure convergence rates for stochastic approximation and reinforcement learning under Markovian noise is a fundamental theoretical challenge. We make progress towards this challenge for a class of stochastic approximation algorithms whose expected updates are contractive, a setting that arises in many reinforcement learning algorithms such as $Q$-learning and linear temporal difference learning. Specifically, for a power-law learning rate $O(n^{-η})$ with $η\in (1/2, 1)$, we obtain an almost sure convergence rate arbitrarily close to $o(n^{1 - 2η})$. For a harmonic learning rate $O(n^{-1})$, we obtain an almost sure convergence rate arbitrarily close to $o(n^{-1})$, which we argue is a strong result because it is close to the optimal rate $O(n^{-1}\log\log n)$ given by the law of the iterated logarithm (for a special case of i.i.d. noise). Key to our analysis is a novel Lyapunov drift construction that applies a Poisson-equation based correction for Markovian noise to the well-established Moreau-envelope smoothing for the contractive mapping.
Statistical Convergence of Spherical First Hitting Diffusion Models
Bienewald, Simon, Trottner, Lukas
Denoising diffusion models have evolved into a state-of-the-art method for tasks in various fields, such as denoising and generation of images, text generation, or generation of synthetic data for training of other machine learning models. First hitting diffusion models (FHDM) are a particular class of denoising diffusion models with \textit{random} adaptive generation time tailored to generate data on a known manifold. Building on the conditioning framework of Doob's $h$-transform these models leverage the given information on the target data manifold to demonstrate strong performance across tasks while offering distinct features such as time-homogeneous dynamics of the generating process and a reduced average simulation time. Even though the theoretical investigation of standard forward-backward diffusion models has attracted much attention in the recent past, the statistical convergence properties of FHDMs are not yet understood. In this work, we show that, up to logarithmic factors, FHDMs achieve the minimax optimal convergence rate in total variation for spherically supported Sobolev smooth data distributions. In particular, this is the first statistical optimality result for denoising diffusion modelling with random generation time.